NCORES case study

Here we will examine the use of the NCORES parameter on the three test cases described above, and try to explain how the program achieves the performance results that it does for each case. Timing results are provided below in Table 3.

• Case 1: Medium Size Static Problem (see Table 3) — For this test case, we find that increasing the NCORES parameter has little effect on increasing the speed of the program, but why? To explain this, the user must understand that only parallel computations within each iteration of the conjugate gradient solver are affected by the NCORES parameter. In this case, the program spends a substantial portion of time in the sequential setup portions of the code (loading screenline counts, processing ICP files, reading/writing matrix files, etc.), and only performs a small number of iterations of the estimation loop for two of the three user classes before converging for each user class. This small amount of time spent in the estimation loop leaves little chance for increases in performance by increasing the NCORES parameter because this particular problem does not spend enough time in parallel sections of the code. Another important aspect of parallel computing made clear by this example is that there will always come a point where increasing the number of cores will no longer provide a performance benefit, and that increasing the number of cores beyond this limit will actually cause a decrease in performance. Note that the 8-core time for this example is actually longer than the 4-core time. This is because somewhere between the 4 and 8-core mark, the cost of communication overhead between the cores exceeds the benefit of parallel computation.

• Case 2: Medium Size Dynamic Problem (see Table 3) — This test case provides a nice example of the benefits of parallel computation as we see that increasing the NCORES parameter consistently results in faster execution time while at the same time illustrates the important principle of diminished returns. In parallel computing, for problems which are not embarrassingly parallel (i.e. everything is parallel with little or no need for sequential code sections) we find that the parallel efficiency decreases as the number of cores increases. We define parallel efficiency as the speedup factor divided by the number of cores used; note that perfect parallel efficiency gives a value of 1.0 which would correspond to linear scaling. To be certain, we find that for this problem we have decreasing parallel efficiency from 0.875 on two cores to 0.455 on eight cores. The lesson to take from this example is that when increasing the number of cores for a given problem using Analyst Drive, there will be a diminishing return effect on the performance gains even when the problem is a good candidate for parallel computation.

• Case 3: Large Size Static Problem (see Table 3) — Much like the results found in Test Case 2 above, this case provides another good example of the performance increases afforded by parallel computation. In this case we find greater parallel efficiency but again see that efficiency decrease from two to eight cores as with Test Case 2 though at a slower rate. Calculating the rate of decrease in parallel efficiency from two to eight cores by computing its slope over the 6-core span, we find that the efficiency of Test Case 2 decreases at a rate of 0.07 while that of this test case is a lower 0.05875. Why is this so? It turns out that this problem spends the vast majority of its time within parallel sections of code due to its large nature; thus, it produces greater parallel efficiency over the core range tested. This leads us to two new generalizations about parallel computing with Analyst Drive which can be made:

1. The bigger the problem, the better the parallel efficiency, and

2. The bigger the problem, the slower the decrease in parallel efficiency as the number of cores increases.

Table 3: Performance comparison for example problems with varying NCORES parameter. Speedup represents the multiple over the base 1 core time. Case 1: Medium size static estimation problem with 3 user classes, 4681 zones and 954 screenline counts. Case 2: Medium size dynamic estimation problem with 2 user classes, 57 Zones, 12 time intervals, 44 screenline counts (class 1) and 42 screenline counts (class 2). Case 3: Very large static estimation problem with 9 user classes, 10,000+ Screenline counts, and 6000+ zones. The top number of each entry indicates the total run time in seconds and the bottom number represents the speedup multiple over a single core.

To provide a brief summary of the cases, we find that using the NCORES parameter to allocate available CPU cores can greatly reduce run times and in general provides bigger performance boosts with larger data sets. We find that there is a diminishing performance return when increasing the value of NCORES due to sequential portions of the program, and that there always will come a point for all problems where increasing the number of cores will no longer provide a performance increase due to the cost of communication overhead.